Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset

نویسندگان

  • Yingjie Zhou
  • Mark Goldberg
  • Malik Magdon-Ismail
  • William A. Wallace
چکیده

Archived organizational email datasets have been considered valuable data resources for various studies, such as spam detection, email classification, Social Network Analysis (SNA), and text mining. Similar to other forms of raw data, email data can be messy and needs to be cleaned before any analysis is conducted. However, few studies have presented investigation on the cleaning of archived organizational emails. This paper examines the properties of organizational emails and difficulties faced in the cleaning process. Cleaning strategies are then proposed to solve the identified problems. The strategies are applied to the Enron email dataset. Contact: Yingjie Zhou Dept. of Decision Sciences and Engineering Systems Rensselaer Polytechnic Institute Troy, NY 12180 Tel: 1-518-276-8457 Fax: 1-518-276-8227 Email: [email protected]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting Unusual and Deceptive Communication in Email

Deception theory suggests that deceptive writing is characterized by reduced frequency of first-person pronouns and exclusive words, and elevated frequency of negative emotion words and action verbs. We apply this model of deception to the Enron email dataset, and then apply singular value decomposition to elicit the correlation structure between emails. This allows us to rank emails by how wel...

متن کامل

Detecting unusual email communication

Deception theory suggests that deceptive writing is characterized by reduced frequency of firstperson pronouns and exclusive words, and elevated frequency of negative emotion words and action verbs. We apply this model of deception to the Enron email dataset, and then apply singular value decomposition to elicit the correlation structure between emails. Those emails that have high scores using ...

متن کامل

Learning User Embeddings from Emails

Many important email-related tasks, such as email classification or search, highly rely on building quality document representations (e.g., bag-of-words or key phrases) to assist matching and understanding. Despite prior success on representing textual messages, creating quality user representations from emails was overlooked. In this paper, we propose to represent users using embeddings that a...

متن کامل

Email Classification Using Machine Learning Algorithms

Email has become one of the frequently used forms of communication. Everyone has at least one email account. Inflow of spam messages is a major problem faced by email users. Currently there are many spam filtering techniques. As the spam filtering techniques came up, spammers improved their methods of spamming. Thus, an effective spam filtering technique is the timely requirement. In this paper...

متن کامل

Inferring Formal Titles in Organizational Email Archives

In the social network of large groups of people, such as companies and organizations, formal hierarchies with titles and lines of authority are established to define the responsibilities and order of power within that group. Although this information may be readily available for individuals within that group, the context this hierarchy provides in communications is not available to those outsid...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007